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Abstract 

We investigate training and using Gaussian kernel SVMs by approximating the kernel with an explicit finite- 
dimensional polynomial feature representation based on the Taylor expansion of the exponential. Although not as 
efficient as the recently-proposed random Fourier features | Rahir m~and Recht||2007| in terms of the number of fea- 
tures, we show how this polynomial representation can provide a better approximation in terms of the computational 
cost involved. This makes our "Taylor features" especially attractive for use on very large data sets, in conjunction 
with online or stochastic training. 



1 Introduction 

In recent years several extremely fast methods for training linear support vector machines have been developed. These 
are generally stochastic (online) methods, which work on one example at a time, and for which each step involves only 



simple calculations on a single feature vector: inner products and vector additions [Shalev-Shwartz et al. 2007 Hsieh 



et al. 2008). Such methods are capable of training support vector machines (SVMs) with many millions of examples 



in a few seconds on a conventional CPU, essentially eliminating any concerns about training runtime even on very 
large datasets. 

Meanwhile, fast methods for training kernelized SVMs have lagged behind. State-of-the-art kernel SVM training 
methods may take days or even weeks of conventional CPU time for problems with a million examples of effective 
dimension less than 100. While the stochastic methods mentioned above can indeed be kernelized, each iteration then 
requires the computation of an entire row of the kernel matrix, i.e. the entire data set needs to be considered in each 
stochastic step. 

Any Mercer kernel implements an inner-product between a mapping of two input vectors into a high dimensional 
feature space. In this paper we propose an explicit low-dimensional approximation to this mapping, which, after being 
applied to the input data, can be used with an efficient linear SVM solver. The dimension of the approximate map- 
ping controls the computational difficulty and the approximation qualities. The key to choosing a good approximate 
mapping comes in trading off these considerations. 

|Rahimi and Rec ht [2007 1 proposed such a feature representation for the Gaussian kernel (as well as other shift-invariant 
kernels) using random "Fourier" features: each feature (each coordinate in the feature mapping) is a cosine of a random 
affine projection of the data. 

In this paper we study an alternative simple feature representation approximating the Gaussian kernel: we take a low- 
order Taylor expansion of the exponential, resulting in features that are scaled monomials in the coordinates of the 
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input vectors. We focus on the Gaussian kernel, but a similar approach could also work for other kernels which depend 
on distances or inner products between feature vectors, e.g. the sigmoid kernel. 

At first glance it seems that this Taylor feature representation must be inferior to random Fourier features. The theo- 
retical guarantee on the approximation quality is given by the error of a Taylor series, and is expressed most naturally 
in terms of the degree of the expansion, of which the number of features is an exponential function. Indeed, to achieve 
the same approximation quality, we need many more Taylor than random Fourier features (see Section|4]for a detailed 
analysis). Furthermore, the Taylor features are not shift and rotation invariant, even though the Gaussian kernel itself 
is of course shift and rotation invariant. 

However, we argue that when choosing an explicit feature representation, one should focus not on the number of 
features used by the representation, but rather on the computational cost of computing it. In online (or stochastic) 
optimization, each example is considered only once, or perhaps a few times, and the cost to the SVM optimizer 
of each step is essentially just the cost of reading the feature vector. Even if each training example is considered 
several times, the dataset will often be sufficiently large that precomputing and saving all feature vectors is infeasible. 
For example, consider a data set of hundreds of millions of examples, and an explicit feature mapping with 100,000 
features. Although it might be possible to store the input representation in memory, it would require tens of terabytes to 
store the feature vectors. Instead, one will need to re-compute each feature vectors when required. The computational 
cost of training is then dominated by that of the computing the feature , and we should judge the utility of a feature 
mapping not by the approximation quality as a function of dimensionality, but rather as a function of computational 
cost. 

We will discuss how the cost of computing the Taylor features can be dramatically less than that of the random 
Fourier features, especially for sparse input data. In fact, the advantage of the Taylor features over the random Fourier 
features for sparse data is directly related to the Taylor features not being rotationally and shift invariant, as these 
operations do not preserve sparsity. We demonstrate empirically that on many benchmark datasets, although the Taylor 
representation requires many more features to achieve the same approximation quality as random Fourier features, it 
nevertheless outperforms a random Fourier features in terms of approximation and prediction quality as a function of 
the computational cost. 



Related Work Fine and Scheinberg [ 2002 1 and Balcan et al. [2006 1 suggest obtaining a low-dimensional approxima 



tion to an arbitrary kernel by approximating the empirical Gram matrix. Such approaches invariably involve calculating 
a factorization of (at least a large subset of) the Gram matrix, an operation well beyond reach for large data sets. Here, 
we use an efficient non-data-dependent approximation that relies on analytic properties of the Gaussian kernel. 



A similar approximation of the Gaussian kernel by a low-dimensional Taylor expansion was proposed by Yang et al. 
[2006], who used this approximation to speed up a conjugate gradient optimizer. |Xu et al.| [2004| also proposed the 
use of the Taylor expansion to explicitly approximate the Hilbert space induced by the Gaussian kernel, but presented 
neither experiments nor a quantitative discussion of the approximation. We are not aware of any comparison of the 
Fourier features with the Taylor features, beyond a passing mention by Rahimi a nd Recht] |2008 1 that the number of 
Taylor features required for good approximation grows rapidly. In particular, we are not aware of a previous analysis 
taking into account the computational cost of generating the features, which is an important issue that, as we discuss 
here, changes the picture entirely. 



2 Kernel Projections and Approximations 

Consider a classifier based on a predictor / : X — > R, which is trained by minimizing the regularized training error 
on a training set of examples S = {x,-,);,-}^^, where x, 6 3£ and yi £ W . Here we take W = {±1}, and minimize the 
hinge loss, although our approach holds for other loss functions, including multiclass and structured loss. 

The "kernel trick" is a popular strategy which permits using linear predictors in some implicit Hilbert space J$?, i.e. 
predictors of the form f(x) = (w,0(x)), where is regularized, and : 3£ — > J$? is given implicitly in terms 
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of a kernel function K(x,x') — (0(x),0(x')). The Representer Theorem guarantees that the predictor minimizing the 
regularized training error is of the form: 

m m 

f(x) = £a i -(0(x / ),0(x)) =£a^(x i; x) (1) 

1=1 1=1 

for some set of coefficients a, e M. It suffices then, to search over the coefficients a,- € R when training. However, 
when the size of the training set, m, is very large, it can be very expensive to evaluate ([1} for even a single x. For 
example, for a d dimensional input space 3C = Mr, and with a kernel whose evaluation runtime is even just linear in d 
(e.g. the Gaussian kernel, as well as most other simple kernels), evaluating ([JJ) requires 0(d ■ m) operations. The goal 
of this paper is to study an explicit finite dimensional approximation : R d — > R D to the mapping 0, which alleviates 
the need to use the representation ([TJ. We will then consider classifiers / of the form: 

/(x) = (w,£(x)) 

where w € K D is a weight vector which we represent explicitly. Evaluating /(x) requires 0(D) operations, which is 
better than the representation ([TJ when D <^d- m. 

One option for constructing such an approximation is to project the mapping onto a Z)-dimensional subspace of Jf?: 
0(x) = f0(x). This raises the question of how one may most effectively reduce the dimensionality of the subspace 
within which we work, while minimizing the resulting approximation error. Our first result will bound the error which 
results from solving the SVM problem on a subspace of <%€ '. 

Consider the kernel Support Vector Machines (SVM) optimization problem (using (•)+ to denote max{0, •}): 

A i m 

minp(w) = 2ll w ll^+-L( 1 ->'< w ^( x ')))+ ( 2 ) 

i=i 

and denote by /5(w) the objective function which results from replacing the mapping (j) with the approximate mapping 
0. Recall also that K(x,x') = (0(x),0(x)) and denote K(x,x') = (0(x),0(x')) 

Theorem 1. Let p* = inf w p(w) be the optimum value of (|2j. For any approximate mapping 0(x) = P<j)(x) defined by 
a projection P, let p* = mfy, p(w) be the optimum value of the SVM with respect to this feature mapping. Then: 

1 '" / 

P* < P* < P* + —7= E JKfaxd-Rfaxt) 
my/ A j= i v 



Note that since we also have ||$(x)|| < ||0(x)||, it is meaningful to compare the objective values of the SVM. 



Proof. For any w, we will have that p(Pw) < p(w) since ||-Pw|| 2 ^ < || 

w ll^f ' while the loss term will be identical. This 

implies that p* < p*. For the second part of the inequality, note that: 

|(1 -3>,(w,0(x,))) + - (1 -y,(w, J P0(x,))) + | 
<|(w,0(x,)-P0(x,))| 

<||w||^|| J P ± 0(x,)||^ 

which implies that p(w) < p(w) + ^\\w\\jelliLi ll^ >± 0( x i')ll^ f° r an Y w > an d m particular for w* the optimum of 
p(w). This, combined with ||w*||jf < -Jj [Shalev-Shwartz et al. 2007 yields: 



p*<p* + -— 

my A j= i 

1 



0\\je 



my A ,- = j 



□ 
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Theorem[T]suggests using a low-dimensional projection minimizing Ya= i v^( x 'j x ') ~^( x i7 x i) — TJ"= 1 1 1 ( x ) — P$ ( x ) I 
That is, that one should choose a subspace of with small average distances to the data (not squared distances as in 
PCA). The Taylor approximation we suggest is such a projection, albeit not the optimal one, so we can apply Theorem 
[T]to analyze its approximation properties. 



Approximating with Random Features A different option for approximating the mapping <p for a radial kernel of 
the form K(x,x') = K(x — x'), was proposed by |Rahimi and Recht| | |2007| . They proposed mapping the input data to 
a randomized low-dimensional feature space as follows. Let K(co) be the real-valued Fourier transform of the kernel 
K(x — x'), namely 

K(x-x')= f K(co)cosco-(x-x')dco (3) 
Bochner's theorem ensures that if K(x — x') is properly scaled, then K(co) is a proper probability distribution. Hence: 



K(x-x') 



E 



(o~K(co 



[cos co-(x — x')] 



= E a) ^£, tB Jcos((o-x+ 6) ■ cos(o)-x' + 0) 



The kernel function can then be approximated by independently drawing ©i , . . . , (Od € R d from the distribution K(co) 
and 9\,... ,9d uniformly from [0, 2n], and using the explicit feature mapping: 



(j)j(x) = cos((Oj-x+ 6j) 



(4) 



In the case of the Gaussian kernel, K(x~x') = exp (-||x-x'|| 2 /2c7 2 ), and K(co) = (2^) _D / 2 exp (-||co|| 2 /2a 2 ) 
defines a Gaussian distribution, from which it is easy to draw i.i.d. samples. 

The following guarantee was provided on the convergence of kernel values ^(x,x') = (0(x),0(x')) corresponding to 
the random Fourier feature mapping: 

Claim 1 (Rahimi and Recht, Claim 1). Let K be the kernel defined by D random Fourier features, and R be the radius 
(in the input space) of the training set, then for any £ > 0: 



Pr 



sup \K (x,y) — K (x,y) \ > e 

|.vlU|vl|<R 



< 2 5 



e A(2+d) 



(5) 



It is also worth mentioning that the random Fourier features are invariant to translations and rotations, as is the kernel 
itself. However, due to the fact that each corresponds to an independent random projection, a collection of such 
features will not, in general, be an orthogonal projection, implying that Theorem[T]does not apply. 



3 Taylor Features 

In this section we present an alternative approximation of the Gaussian kernel, which will be obtained by a projection 
onto a subspace of M '. The idea is to use the Taylor series expansion of the Gaussian kernel function with respect to 
(x,x'), where each term in the Taylor series can then be expressed as a sum of matching monomials in x and x'. More 
specifically, we express the Gaussian kernel as: 

||x— x'|p ||x||2 llx'lP ( x i x ') 

A"(x,x')=e~ 2<r2 — e ~ ^ e~ ^ e ^ (6) 

The first two factors depend on x and x' separately, so we focus on the third factor. The term z — (x,x')/(7 2 is a real 
number, and using the (scalar) Taylor expansion of e z around z = 0we have: 
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We now expand: 

,=i je[d] k v=i / \i=i ) 

where j enumerates over all selections of k coordinates of x (for simplicity of presentation, we allow repetitions and 
enumerate over different orderings of the same coordinates, thus avoiding explicitly writing down the multinomial 



coefficients). We can think of 
Plugging this back into |7]i and ( 3 



as an inner product between degree k monomials of the coordinates of x and x'. 
results in the following explicit feature representation for the Gaussian kernel: 

< j >kJ ( x )=e-^^ 7 =flx Ji (9) 

;=o 



with K(x,x') = (0(x), 0(x')) = nit°=orife[d]* 0it,;'( x )0*j( x ')- Now, for our approximate feature space, we project onto 
the coordinates of 0(-) corresponding to k < r, for some degree r. That is, we take 0Y/(x) = 0*j(x) for k < r. This 
corresponds to truncating the Taylor expansion |7} after the rth term. 

We would like to bound the error introduced by this approximation, i.e. bound \K(x,x') — K(x,x') \ where: 



l|x|| 2 +||x'|| 2 JL 1 //vv'\\ * 

K(x 7 x>) = <<H X ),<H*')> = e ^ £ - ^ (10) 



The difference \K(x,x') — K(x,x')\ is given (up to the scaling by the leading factor) by the higher order terms of 

r+l 

the Taylor expansion of e z , which by Taylor's theorem are bounded by ^ +l y £ a for some \a\ < |z|. We may bound 
\a\ < (x,x')/ff 2 and |(x,x')| < ||x|| ||x'||, obtaining: 

ll"ll 2 +IK|| 2 i /(xx')\ r+1 <"') 

j / II II II ^ II \ r+l 



As for the dimensionality D of 0( ) (i.e. the number of features of degree not more than r), as presented we have 
d k features of degree k. But this ignores the fact that many features are just duplicates resulting from different per- 
mutations of j. Collecting these into a single feature for each distinct monomial (with the appropriate multinomial 
coefficient), we have ( features of degree k, and a total of D = ( d ^ r ) features of degree at most r. 



4 Theoretical Comparison of Taylor and Random Fourier Features 

We now compare the error bound of the Taylor features given in ( fTTj i to the probabilistic bound of the random Fourier 
features given in Q. 

We first note that each Taylor feature may be calculated in constant time, because each degree-A: feature may be 
derived from a degree-fc— 1 feature by multiplying it by a constant times an element of x. In fact, because each feature 
is proportional to a product of elements of x, on sparse datasets, the Taylor features will themselves be highly sparse, 
enabling one to entirely avoid calculating many features. For a vector x with d nonzeros, one may verify that there 
will be ( d + f )=0{d r ) nonzero features of degree at most r, which can all be computed in overall time 0(d r ). 

In contrast, computing each Fourier feature requires 0(d) time on a vector with d nonzeros, yielding an overall time 
of 0(D ■ a) to compute D random Fourier features. 

With this in mind, we will define B as a "budget" of operations, and will take as many features as may be computed 
within this budget, assuming that each nonzero Taylor feature may be calculated in one operation, and each Fourier 
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Table 1: Datasets used in our experiments. The "Dim" and "NZ" columns contain the total number of elements in each train- 
ing/testing vector, and the average number of nonzeros elements, respectively. 





Dataset 








Kernel SVM 


Linear SVM 


Name 


Train size 


Test size 


Dim 


NZ 


C 


a 1 


Test error 


C 


Test error 


Adult 


32562 


16282 


123 


13.9 


1 


40 


14.9% 


8 


15.0% 


Covl 


522911 


58101 


54 


12 


3 


0.125 


6.2% 


4 


22.7% 


MNIST 


60000 


10000 


768 


150 


1 


100 


0.57% 


2 


5.2% 


TIMIT 


63881 


22257 


39 


39 


1 


80 


11.5% 


4 


22.7% 



feature in d. Setting 8 = Pr ^K^^y) —K(x,y) \ > e] and solving |5]l for e, with Dr;|, yields that with probability 
1 — 5, for the Fourier features: 



|^(V)-lM)|«ofy'^log^|^^j (12) 

For the Taylor features, B = ( d ^ r ) implies that r + 1 <J Applying Stirling's approximation to ( fTT| yields: 

\K( X , X ')-K( X , X ')\ *0 (J^B^M^S)) 

Neither of the above bounds clearly dominates the other. The main advantage of the Taylor approximation, also seen 
in the above bounds, is that its performance only depends on the number of non-zero input dimensions d, unlike the 
Fourier random features which have a cost which scales quadratically with the dimension, and even for sparse data 
will depend (linearly) on the overall dimensionality. The computational budget required for the Taylor approximation 
is polynomial in the number of non-zero dimensions, but exponential in the effective radius (R/a). Once the budget 
is high enough, however, these features can yield a polynomial decrease in the approximation error. This suggest that 
Taylor approximation is particularly appropriate for sparse (potentially high-dimensional) data sets with a moderate 
number of non-zeros, and where the kernel bandwidth is on the same order as the radius of the data (as is often the 
case). 




5 Experiments 

In this section, we describe an empirical comparison of the random Fourier features of Rahimi and Recht and the 
polynomial Taylor-based features described in Section[3] The question we ask is: which explicit feature representation 
provides a better approximation to the Gaussian kernel, with a fixed computational budget? 

Experiments were performed on four datasets, summarized in Table [T] Adult and MNIST were downloaded from 
Leon Bottou's LaSVM web page. They, along with Blackard and Dean's forest covertype-1 dataset (available in the 
UCI machine learning repository), are well-known SVM benchmark datasets. TIMIT is a phonetically transcribed 
speech corpus, of which we use a subset for framewise classification of the stop consonants. From each 10 ms frame 
of speech, we extracted MFCC features and their first and second derivatives. Both MNIST and TIMIT are multiclass 
classification problems, which we converted into binary problems by performing one-versus-rest, with the digit 8 and 
phoneme /k/ being the positive classes, respectively. The regularization and Gaussian kernel parameters for Adult, 
MNIST and Covl are taken from |Shalev-Shwartz et al.||2010| , and are in turn based on those of |Platt| |1998| and 



Bordes et al. |2005|. The parameters for TIMIT were found by optimizing the test error on a held-out validation set. 



Of these datasets, all except MNIST are fairly low-dimensional, and all except TIMIT are sparse. To get a rough sense 
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Adult MNIST T1MIT 




i . . . . oJ . . . . 0.1J . . . . 

0.5 1 1.5 2 2 4 6 8 0.5 1 1.5 2 

Computational cost x ^ Computational cost x ^ Computational cost x ^ 



Figure 1: Primal objectives and testing classification errors for various numbers of Fourier and Taylor features. For 
the Fourier features, the markers correspond to numbers of features which are powers of two, starting at 32. For Taylor, 
each marker corresponds to a degree, starting at 2. The cost of calculating 0, in units of floating point operations, is 
displayed on the horizontal axis. The solid black lines are the primal objective function value and testing classification 
error achieved by the optimal solution to the Gaussian kernel SVM problem, while the dashed lines in the bottom plots 
are the testing classification error achieved by a linear SVM. 

of the benefit of the Gaussian kernel for these data sets, we also include in Table [T]the best test error obtained using a 
linear kernel over Cs in the range 

For each of the data sets, we compared the value of the (primal) SVM objective and the classification performance 
(on the test set) achieved using varying numbers of Taylor and Fourier features. Results are reported in Figure [T] 
and in the left column of Figure [2] We report results in units of the number of floating-point operations required to 
calculate each feature vector, taking into account sparsity, as discussed i n Section [4] As was d iscussed earlier, this 
is the dominant cost in stochastic optimization methods such as Pegasos [Shalev-Shwartz et al. , 2007 1 and stochastic 
dual coordinate ascent [Hsieh et al. 2008 1, which are the fastest methods of training large-scale liner SVMs. We used 
a fairly optimized SGD implementation into which the explicit feature vector calculations were integrated. Our actual 
runtimes are indeed in line with the calculated computational costs (we prefer reporting the theoretical cost as it is not 
implementation or architecture dependent). 

As can be seen from Figures[T]and|2] the computational cost required to obtain the same SVM performance is typically 
lower when using the Taylor features than the Fourier features, despite the exponential growth of the number of features 
as a function of the degree. The exception is the MNIST dataset, which has a fairly high number (over 150) of non-zero 
dimensions per data point, yielding an extremely sharp increase in the computational costs of higher-degree Taylor 
feature expansions. 

To better appreciate the difference between the dependence on the number of features and that on the computational 
cost we include more detailed results for the Covl dataset, in Figure [2] Here we again plot the value of the SVM 
objective, this time both as function of the number of features and as a function of the computational cost. As expected, 
the Fourier features perform much better as a function of the number of features, but, as argued earlier, we should be 
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Covl 




5000 10000 
Computational cost 




10 10- 
Computational cost 




10' 10' 
# features 



Figure 2: Left column: same as Figure[T] Top middle: primal objective as a function of the total number of features (in 
log scale). Bottom middle: test error as a function of the Gaussian kernel parameter a, for Taylor and random Fourier 
expansions of the same computational cost, compared with the true Gaussian kernel. The dot-dashed line indicates 
the performan ce of a 1-NN classifier, trained using the ANN library | Arya and Mount||l 993 1 [Mount and Arya 2006 



Bagon||2009|. Right column: average value of the approximation error \K(x,x') — K(x,x') | over 100000 randomly 
chosen pairs of training vectors, in terms of both computational cost and total number of features 



more concerned with the cost of calculating them. In order to directly measure how well each feature representation 
approximates the Gaussian kernel, we also include in Figure|2]a comparison of the average approximation error. 

Next, we consider the effect of the bandwidth parameter a on the Taylor and Fourier approximations — note that the 
theoretical analysis for both methods deteriorates significantly when the bandwidth decreases. This is verified in the 
bottom-middle plot of Figure [2] which shows that the (test) classification error of the two approximations (with the 
same fixed computational budget) deteriorates, relative to that of the true Gaussian SVM, as the bandwidth decreases. 
This deterioration can be observed on other data sets as well. On Covl, the deterioration is so strong that even though 
the generalization performance of the true Gaussian Kernel SVM keeps improving as the bandwidth decreases, the 
test errors for both approximations actually start increasing fairly quickly. It should be noted that Covl is atypical in 
this regard: nearest-neighbor classification achieves almost optimal results on this dataset (the dot-dashed line in the 
bottom-middle plot of Figure|2|l, and so decreasing the bandwidth, which approximates the nearest-neighbor classifier, 
is beneficial. In contrast, on the data sets in Figure [T] the optimal bandwidth for the Gaussian kernel is large enough 
to allow good approximation by the Fourier and Taylor approximations. 

Finally, in order to get some perspective on the real-world benefit of the Taylor features, we also report actual runtimes 
for a large scale realistic example. We compared training times for the Gaussian kernel and the Taylor features, on 
the full TIMIT dataset, where the goal was framewise phoneme classification, i.e., given a 10 ms frame of speech 
the goal is to predict the uttered phoneme from a set of 39 phoneme symbols. We used the standard split of the 
dataset to training, validation test sets, and extracted MFCC features. With this set of acoustic features the common 
practice in to use the Gaussian kernel. Its bandwidth was selected on the validation set to be a 2 = 19. The training 
set includes 1.1 million examples, and existing SVM libraries such as SVMLIB or SVMLight failed to converge in a 
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Table 2: Comparison of the Gaussian kernel and its Taylor approximation to the polynomial kernel K(x,y) = 
((x,y) + l) d , after scaling the data to have unit average squared norm. Here, d is the degree of the polynomial. 
The reported test errors are the minima over parameter choices taken from a coarse power-of-two based grid, within 
which the reported parameters are well inside the interior. Gaussian kernel SVMs were optimized using our GPU 
optimizer, while the others were optimized by running Pegasos for 100 epochs. 







Gaussian 




Taylor 






Polynomial 


Dataset 


C 


a 2 


Test error 


degree 


C 


a 2 


Test error 


degree 


C 


Test error 


Adult 


4 


100 


14.9% 


4 


8 


200 


14.7% 


4 


4 


14.8% 


MNIST 


8 


100 


0.42% 


2 


2048 


200 


0.54% 


2 


256 


0.58% 


TIMIT 


2 


40 


10.8% 


3 


8 


200 


11.4% 


3 


64 


11.6% 


Covl 


16 


0.03125 


3.3% 


4 


128 


0.5 


12.3% 


4 


512 


13.6% 



reasonable amount of time (see the training time in Salomon et al. |2002|). Using our own implementation with the 
exact Gaussian kernel and stochastic dual coordinate ascent, the training took 313 hours (almost two weeks) on 2GHz 
Intel Core 2 (using one core). Using the same implementation with the kernel function replaced by its degree-3 Taylor 
approximation, the training took only 53 hours. The results were almost the same: multiclass accuracy of 69.6% for 
the approximated kernel and 69.8% for the Gaussian kernel. These are state-of-the-art results for this dataset | Salomon 
[etaLl [20021 |Graves and Schmidhuber| [20051 . 



6 Relationship to the Polynomial Kernel 

Like the Taylor feature representation of the Gaussian kernel, the standard polynomial kernel of degree r. 

K(x,x , ) = ((x,x , )+c) r (14) 

corresponds to a feature space containing all monomials of degree at most r. More specifically, the features corre- 
sponding to the kernel ( fl4| i can be written as: 




where, as in k — 0,...,r and j g [d] enumerates over all selections of k coordinates in x. The difference, relative 
to the Taylor approximation to the Gaussian, is only in a per-example overall scaling factor based on ||x||, and in a 
different per-degree factor (which depends only on the degree k). This weighting by a degree-dependent factor should 
not be taken lightly, as it affects regularization, which is key to SVM training-features scaled by a larger factor are 
"cheaper" to use, compared to those scaled by a very small factor. Comparing the degree-dependent scaling in the two 
feature representations (|9} and (fT5]>), we observe that the higher degrees are scaled by a much smaller factor in the 
Taylor features, owing to the rapidly decreasing dependence on 1 / \fk\. This means that higher degree monomials are 
comparatively much more expensive for use in the Taylor features, and that the learned predictor likely relies more on 
lower degree monomials. 

Nevertheless, the space of allowed predictors is nearly the same with both types of features, raising the question of 
how strong the actual effect of the different per-degree weighting is. The fact that all of the features in the Taylor 
representation are scaled by a factor depending on should make little difference on many datasets, as it affects 
all of the features of a given example equally. Likewise, if most of the used features are of the same degree, then we 
could perhaps correct for the degree -based scaling by changing the regularization parameter. The problem, of course, 
is searching for and selecting this parameter. 

We checked if we could find a substantial difference in performance between the Taylor and standard polynomial 
features. Because the dependence on the regularization parameter necessitated a search over the parameter space, we 
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conducted a rough experiment in which we tried different parameters, and compared the best error achieved on the 
test set using the true Gaussian kernel, a Taylor approximation, and a standard polynomial kernel of the same degree. 
The results are summarized in Table [2] These experiments indicate that the standard polynomial features might be 
sufficient for approximating the Gaussian. Still, the Taylor features are just as easy to compute and use, and have 
the advantage that they use the same parameters as the Gaussian kernel. Hence, if we already have a sense of good 
bandwidth and C parameters for the Gaussian kernel, we can use the same values for the Taylor approximation. 



7 Summary 

The use of explicit monomial features of the form of ([15) has been discussed recently as a way of speeding up training 
with the polynomial kernel [Sonne nburg an d Franc] |2010||Chang et aT| |2010|. Our analysis and experiments indicate 
that a similar monomial representation is also suitable for approximating the Gaussian kernel. We argue that such 
features might often be preferable to the random Fourier features recently suggested by Rahimi and Recht [ 2007) . 



This is especially true on sparse datasets with a moderate number (up to several dozen) of non-zero dimensions per 
data point. 

Although we have only focused on binary classification, it is important to note that the this explicit feature represen- 
tation can be used anywhere else £2 regularization is used. This includes multiclass, structured and latent SVMs. The 
use of such feature expansions might be particularly beneficial to structured SVMs, since these problems are hard to 
solve with only a kernel representation. 
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